CSC 453

Winter 2026 Day 11

Admin

  • Midterm 2 on 2/26
  • Assignment 3 due 2/23
  • Zoom for classes / labs / office hours this week

Page replacement

Questions to consider

  • Why can adding more page frames actually increase page faults with certain algorithms?
  • How can we approximate optimal page replacement without knowing the future?
  • What property must a page replacement algorithm have to avoid Belady’s anomaly?

Page replacement algorithms: FIFO

  • First-In-First-Out (FIFO)
  • Simple: first page in (oldest) will be evicted
  • Downsides?
    • Indiscriminate of use: might throw out important pages
  • Can lead to Belady’s anomaly: adding more frames can lead to more page faults
    • Why? Because it doesn’t consider use at all (the stack property)
    • Try this reference pattern with 3 frames and then 4 frames to see the anomaly:

Belady’s anomaly

  • Without the stack property, we can have more page faults with more frames because we are ignoring use
  • Number of page faults with 3 frames = 9
  • Number of page faults with 4 frames = 10

Page replacement algorithms: LRU

  • We can’t tell the future, what should we do?
    • Look at the past
  • Least Recently Used (LRU)
    • Each page has some age with respect to use; replace page that hasn’t been used in the longest time
    • Like OPT, but only looking backwards
      • Unfortunately, there is a chance that the page you evict is needed immediately

LRU (cont’d)

  • LRU does not exhibit Belady’s anomaly
  • How can we implement LRU?
    • Counters: each entry has an associated clock
      • No, waste of space, takes time to search
    • Ordered queue
      • Less space wasted, but a lot more memory accesses (pointers) to keep up-to-date
    • Both are impractical without specialized hardware
    • We can approximate LRU with a single bit in the page table entry: reference bit
      • Bit set when page is referenced
      • Bits set to zero initially and on context switch

Page replacement algorithms: others

  • Second Chance
    • Use a FIFO queue and a reference bit
      • If bit is 0, replace the page
      • If bit is 1, set to 0 and move to the end of the queue
    • Downside?
      • Moving pages around is expensive
  • Clock
    • Page frames in a circular list, and have a clock “hand” point to the newest page
    • When evicting, look at page pointed to by hand
      • If reference bit is 0, replace page
      • If bit is 1, set to 0, and advance hand
    • A nice approximation of LRU

Page replacement algorithms: others (cont’d)

  • Least Frequently Used
    • Keep / increment counter for each page
    • One with lowest value
    • Downside?
      • If something is used a lot at program startup, then no longer. Answer to this is aging
  • Most Frequently Used
    • Same as above only evict the page with highest value
    • Downside?
      • One with the smallest value just arrived, yet to be used
  • Both are bad ideas. They do not approximate OPT and they are expensive

Page replacement algorithms: summary

  • OPT is optimal but impossible. LRU is a good approximation.
  • Many different page replacement algorithms (many not covered here), with different performance characteristics and hardware requirements
  • e.g., Linux: LRU 2Q
    • Two FIFO lists, each simulating the reference bit state
    • No hardware requirement
  • Note that the concept of page replacement is seen at all interfaces of the memory hierarchy, just at different time scales; and, within certain applications
    • We focus on the disk/RAM interface because the difference in access time is so great

What isn’t clear?

Comments? Thoughts?

Page tables

Questions to consider

  • Why is page table size a critical concern for memory efficiency?
  • How can hardware support (like the TLB) improve address translation performance?
  • What are the advantages and disadvantages of different page table storage locations?

Page table size

  • If we have a 32-bit address space and 4KB pages, we have:
    • \(2^{20}\) pages (1M)
    • Each page table entry is typically 4 bytes
    • Total page table size = 4MB per process
    • If we have 100 processes, that’s 400MB just for page tables. Yikes
  • We’ll come back to this later. Bottom line: we do not use linear page tables (what we’ve talked about so far, e.g., the page table as an array)

Page table size (cont’d)

  • To be efficient: we’re going to require some hardware support and clever structures
  • Why do we care?
    • Reason 1 (Speed): Translations must be done on every memory reference
    • Reason 2 (Cost): memory is getting larger:
      • 32-bit words, 4K page: 1 million entries at 4 bytes each = 4MB per process;
      • 64-bit words, 4K page: \(2^{52}\) entries at 8 bytes each = 30M GB (30 PB)

Page table access

  • Where should the page table be stored?
    • Old solution: registers and special hardware (not scalable)
    • Real answer: in main memory
      • PCB holds pointer to page table; context switch needs to account for this
    • See any downsides to this?
      • If the page table is large, it can consume a lot of memory
      • Requires two memory accesses for each actual memory access (one for the page table, one for the data)
      • How do we speed this up?
        • Hardware

Translation lookaside buffer (TLB)

  • Use associative memory
    • Can someone tell me what’s special about associative (a.k.a. content-addressable) memory?
      • Parallel search: All key values can be queried at once; if found, value is returned
  • Comprised of key(tag)-value pairs
    • Ultimately it is a cache for the page table
  • Expensive, so TLB only contains a small number of entries

TLB (cont’d)

  • How it works: logical address from CPU (say, a process is looking to load a variable stored at address 100 in its address space)
  • If it’s a miss (not in TLB); we must fetch the entry from the page table, perhaps replacing an existing entry
    • soft miss: page is in main memory
    • hard miss: page is on disk (swapped, also known as a page fault)

TLB (cont’d)

  • If it’s a hit (in TLB), we can directly get the physical address and access memory
    • TLB is fast enough that a hit results in one memory access (for the data), while a miss results in two memory accesses (one for the page table, one for the data)
  • Hit ratio determines effective access time
  • Example: 80% hit rate; 10ns to access memory
    • A hit = 10ns; a miss = 20ns
    • EAT = \((.8 * 10) + (.2 * 20) = 12ns\)
    • 99% hit? = \((.99 * 10) + (.01 * 20) = 10.1ns\)
  • What about processes that we’d like to keep in the TLB? (kernel)
    • “Wired down” i.e., cannot be replaced

TLB (cont’d)

  • Wait, we have multiple processes with virtual addressing and one small TLB. What happens on a context switch?
    • E.g., virtual page 1 for process one != virtual page 1 for process 2
  • Could flush: empty on context switch
    • Downside?
      • You lose a lot of the goodness of caching. Needs to be rebuilt every time
  • Other idea?
    • Address-space identifiers (ASIDs) will bind TLB entries to a particular process
      • Preventing a process from access a page that doesn’t belong to it, but allowing pages from multiple processes to coexist in the TLB

What isn’t clear?

Comments? Thoughts?

Page table structures

Questions to consider

  • How can we organize page tables to handle large address spaces efficiently?
  • What are the trade-offs between space overhead and lookup complexity?
  • When and why would we choose alternative page table designs?

Page table structures

  • Linear page tables are wasteful; particularly if you have a number of unused pages (likely)
  • Answer?
    • Multilevel page tables

Multilevel page tables

  • Example: Two-level page table: Split the page table address into two
    • 10-bit PT1, 10-bit PT2, and 12-bit offset field
  • Top level page table references 1024 2nd tier page tables
  • Each 2nd tier page table references 1024, 4K pages

  • Benefit?
    • 2nd tier tables need not be resident in memory if unused
    • \(2^{20}\) addressable, with only 4 tables resident (16KB rather than 4MB)

Multilevel page tables (cont’d)

  • This is expandable to more levels
    • Note that 64-bit doesn’t use the full 64-bit
    • 48-bits allow us to address 256TB

Inverted page tables

  • What we’ve seen so far: one page table per process
  • Alternative: one page table for the entire system, with an entry for each frame of physical memory (inverted page table)
    • Each entry contains the virtual address of the page stored in that frame, along with information about the process that owns that page
    • Entry holds which pid owns the frame, and its virtual address

Inverted page tables (cont’d)

  • Reduces size, but increases lookup complexity
    • must do a full search of the page table (slow)
  • We can use a hash table to index into page table, or combine with a TLB to speed up frequent lookups
  • Not commonly used in practice
    • Once RAM became cheaper, the size of page tables became less of an issue, and the complexity of inverted page tables outweighed their benefits

What isn’t clear?

Comments? Thoughts?

Virtual memory

Questions to consider

  • How does virtual memory enable processes to use more memory than physically available?
  • What strategies can we use to determine which pages should be in memory at any given time?
  • What problems occur when memory contention is too high, and how can we detect and prevent them?

Virtual memory summary

  • Previous (naïve) discussions assumed that all of a process’s pages are present in memory when running
  • Virtual Memory is system built around the main ideas supported by paging
  • Allows the system to present a large, logically contiguous memory to every process, while being non-contiguous in physical memory
  • Supported by pages + swap

Virtual memory summary (cont’d)

  • Many advantages
    • Processes see the same logical, contiguous address space, making programming easier
    • Logical memory may be sparse; i.e. a hole in the logical space does not (necessarily) mean a hole in the physical space
    • Pages may be shared among processes (shared libraries, shared memory or fork()’d processes)
  • Logical memory (for a single or many processes) may actually be bigger than physical memory

Memory residency

  • Not all process pages can reside in memory at once; further, we many not need all pages in memory at once
  • But… we need a page to be in main memory to be accessed
  • We’ve talked about choosing victims to replace in memory, How do we select pages to be in main memory?

Demand paging

  • We can load pages into memory only when they are needed, which is called demand paging
  • Here we see why the valid bit is useful
    • Valid implies “allocated and in memory”
    • Invalid implies “not allocated or not in memory”

Demand paging (cont’d)

  • If process has a page fault, we trap to the OS, which
    1. interrupts the process
    2. finds a free frame (if one exists)
    3. requests the block from the backing store or swap space
    4. brings the page into memory
    5. updates the page table
    6. restarts the processes
  • Non-resident pages may have never been in memory (so in the file system) or were once resident and swapped out (to swap space)

Pure demand paging

  • Most extreme scheme
  • In pure demand paging, all pages are non-resident at the start of execution
  • This is not common in practice, as it can lead to a large number of page faults at the start of execution, which can significantly degrade performance

Working set

  • To mitigate the performance issues of pure demand paging, we can use the concept of a working set
  • Takes advantage of the principle of locality: processes tend to access a relatively small set of pages over a short period of time
  • Working-set model:
    • Define a window \(\Delta\) (most recent page references)
    • If a page is referenced within the window, it’s in the working set; otherwise, it’s not
  • Allows for prepaging pages that are likely to be used soon, improving performance by reducing page faults
    • just outside of the working set, the OS can see “the working set is expanding in this direction, so let’s prefetch some pages in that direction”

Thrashing

  • If the working set of a process exceeds the available physical memory, the process will experience a high rate of page faults, leading to a condition called thrashing
  • When the page fault time exceeds the time spent executing, we are said to be thrashing
  • Problem: CPU scheduler sees the decreasing CPU utilization and increases the degree of multiprogramming; makes thrashing worse
  • Solutions?
    • Local replacement algorithm (each process can only select victims from its set of allocated frames)
    • i.e. cannot steal frames from another process and cause the latter to thrash as well
    • Decrease the degree of multiprogramming (reduce the number of processes)

Optimizing shared pages

  • Remember how fork() and exec() work?
    • Fork leads to a copy of the parent’s address space
    • … but exec() means that a lot of children immediately replace, making a copy wasteful
  • What should we do?
    • Share the pages initially
    • If either the parent or the child writes to the shared page, then you actually make a copy (copy-on-write)
    • Copy-on-write significantly speeds up fork operation; particularly if they will soon exec

What isn’t clear?

Comments? Thoughts?

Real-World Memory

Questions to consider

  • How can we increase TLB reach without making the TLB itself larger?
  • What memory security features does the OS and hardware provide to protect against attacks?
  • How do real architectures (IA-32, x86-64, ARM) handle multiple page sizes and the transition beyond 32-bit addressing?

TLB reach

  • We’ve talked about the hit ratio of a TLB
    • Influenced by the size of the TLB
  • Another metric: TLB reach (i.e., how much memory is addressable from the TLB)
    • Num TLB entries * page size
    • Ideal world: entire working set for a process fits in the TLB
  • It’d be great to simply grow the TLB, but that’s expensive
  • What else can we do?

Huge pages

  • As we discussed, page tables can be large, and TLBs are small. This can lead to a high TLB miss rate, which can significantly degrade performance.
  • One solution is to use larger page sizes, known as huge pages
  • By using huge pages (e.g., 2MB instead of 4KB [can go up to 1GB]), we can reduce the number of pages, and thus the size of the page table, and increase the TLB hit rate
  • Newer kernels support transparent huge pages, which automatically use huge pages when possible without requiring changes to applications
  • Danger: internal fragmentation can be much worse with huge pages, so they are not ideal for all workloads

Linux memory security

  • Linux provides several features to enhance memory security, such as:
    • Address Space Layout Randomization (ASLR): randomizes the memory addresses used by a process, making it more difficult for attackers to predict where code or data is located
    • Kernel Address Space Layout Randomization (KASLR): randomizes the memory addresses used by the kernel
    • Data Execution Prevention (DEP): marks certain areas of memory as non-executable, preventing code from being executed from those regions (e.g., the stack)
      • NX bit: hardware support for DEP held in the page table entry

Intel IA-32 architecture

  • Supports both segmentation and segmentation with paging
    • Each segment can be 4GB
    • Up to 16K segments per process
    • Divided into two partitions
      • First partition of up to 8 K segments are private to process (kept in local descriptor table (LDT))
      • Second partition of up to 8K segments shared among all processes (kept in global descriptor table (GDT))

Intel IA-32 architecture (cont’d)

  • CPU generates logical address
    • Selector given to segmentation unit
      • Which produces linear addresses
    • Linear address given to paging unit
      • Which generates physical address in main memory
      • Paging units form equivalent of MMU

Intel IA-32 architecture (cont’d)

  • 4KB and 4MB pages
    • 4KB pages follow what we have talked about
  • How do we do 4MB, and why?
    • 4MB pages are directly pointed to by the page directory (top level in the page table hierarchy)
    • Benefits?
      • Avoids an extra memory access

IA-32 4KB pages (as we’ve seen)

IA-32 4MB pages (only top-level of page table used)

IA-32 architecture (cont’d)

  • Recall that using a classic 32-bit address space we are limited to 4GB addressable space
    • Fine in the 90s, became decidedly not fine long ago
  • How can we fix this?

32-bit Physical Address Extension (PAE)

  • Paging went to a 3-level scheme
  • Page-directory and page-table entries moved to 64-bits in size (used to be 32-bits)
    • The size of the table remains the same, so we go from 1024 4-byte entries to 512 8-byte entries
    • That gives us an extra two bits (offset still 12-bit, 9-bits for top level 512 entries, 9-bits for second level 512 entries, 2-bits left over), used to refer to a page directory pointer table
  • Net effect is increasing total system address space (originally limited to 36-bits [64GB])
  • But, each process is still limited to 4GB of virtual address space (32-bit addresses)

32-bit PAE

PAE (cont’d)

  • Problem solved?
    • For linux, yes. Problem solved
    • Windows? Of course not, Microsoft decided that you should pay extra for the privilege of using the hardware you own

32-bit PAE question

  • You’re working with a 32-bit IA-32 system that has PAE enabled. The system uses:
    • 2-level paging: PDPT → Page Directory → Page
    • Page size: 2 MB
    • Each page table entry (PTE) is 8 bytes
  • How many entries are in the PDPT, page directory?
  • Break the virtual address into its components (PDPT index, page directory index, page offset)
  • How many total pages can be addressed by this system?
  • What is the maximum addressable physical memory for a single process?
  • What is the maximum addressable physical memory for the entire system?

32-bit PAE question

  • How many entries are in the PDPT, page directory? 4, 512
  • Break the virtual address into its components (PDPT index, page directory index, page offset) 2 bits, 9 bits, 21 bits
  • How many total pages can be addressed by this system? 2048 (4 * 512)
  • What is the maximum addressable physical memory for a single process? 4GB
  • What is the maximum addressable physical memory for the entire system? 64GB

x86-64 architecture

  • 64 bits is enormous (> 16 exabytes)
  • In practice only implement 48 bit addressing (256 TB)
    • Page sizes of 4 KB, 2 MB, 1 GB
    • Four levels of paging hierarchy
  • Can also use PAE-like mechanism (“long mode”) so virtual addresses are 48 bits and physical addresses are 52 bits (4 PB)

ARM32 architecture

  • Dominant mobile platform chip (iOS and Android devices for example)
    • 4 KB and 16 KB pages (changed to 4 KB and 64 KB in ARMv7)
    • 1 MB and 16 MB pages (termed sections)
  • One-level paging for sections, two-level for smaller pages

  • Two levels of TLBs
    • Outer level has two micro TLBs (one data, one instruction)
    • Inner is single main TLB
    • First outer is checked, on miss inner is checked, and on that miss page table walk performed by CPU

What isn’t clear?

Comments? Thoughts?

Kernel memory management

Questions to consider

  • Why does the kernel need physically contiguous memory, and how does this differ from user-space allocation?
  • How does the buddy system handle allocation and deallocation, and what are its trade-offs?
  • Why is a slab allocator useful for small, frequently used kernel objects?

Linux kernel memory management

  • Treated differently from user memory
    • Often allocated from a free-memory pool
    • Kernel requests memory for structures of varying sizes
    • Some kernel memory needs to be contiguous
      • i.e., for device I/O

Linux kernel memory management (cont’d)

  • Main idea: allocate memory from fixed-size segment consisting of physically-contiguous pages
  • Buddy system allocator
    • Memory is allocated in blocks of size \(2^k\) (for some integer \(k\))
    • When a request for memory of size \(s\) is made, the allocator finds the smallest block of size \(2^k\) that can accommodate \(s\)
    • If the block is larger than needed, it is split into two “buddies” of size \(2^{k-1}\) each
    • When a block is freed, the allocator checks if its buddy is also free; if so, they are coalesced back into a larger block

Buddy system allocator (cont’d)

  • Advantages?
    • Simple and efficient for allocating memory of varying sizes
    • Can quickly coalesce free blocks
  • Disadvantages?
    • Fragmentation
      • Internal fragmentation: if a request is for 5KB, we need to allocate an 8KB block, wasting 3KB
      • External fragmentation: over time, as blocks are allocated and freed, we can end up with many small free blocks that cannot be coalesced into larger blocks, making it difficult to satisfy larger allocation requests
    • Not ideal for all workloads, particularly those with many small allocations

Slab allocator

  • Linux also uses a slab allocator for small objects
  • Slab allocator maintains caches of pre-allocated memory chunks for frequently used, fixed-size objects (e.g., file descriptors, process control blocks)
  • When an object is needed, the allocator can quickly provide a chunk from the appropriate cache, reducing fragmentation and improving performance for small allocations

What isn’t clear?

Comments? Thoughts?

Midterm 1 boundary